AITopics | tabular mdp

We study offline-online reinforcement learning in linear mixture Markov decision processes (MDPs) under environment shift. In the offline phase, data are collected by an unknown behavior policy and may come from a mismatched environment, while in the online phase the learner interacts with the target environment. We propose an algorithm that adaptively leverages offline data. When the offline data are informative, either due to sufficient coverage or small environment shift, the algorithm provably improves over purely online learning. When the offline data are uninformative, it safely ignores them and matches the online-only performance. We establish regret upper bounds that explicitly characterize when offline data are beneficial, together with nearly matching lower bounds. Numerical experiments further corroborate our theoretical findings.

machine learning, reinforcement learning, zhangandsinclair, (20 more...)

arXiv.org Machine Learning

2604.11994

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.49)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
(2 more...)

Add feedback

Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward

Neural Information Processing SystemsFeb-17-2026, 21:20:54 GMT

The latter case can be further reduced to adversarial MDP when preferences only depend on the final state.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Europe > France (0.04)
Europe > Austria (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Add feedback

62a9c80248963f348778a9c0bec060dd-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 10:40:31 GMT

algorithm, mdp, reward function, (15 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Lombardy > Milan (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report > Experimental Study (0.92)

Industry: Education > Educational Setting > Online (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

b733cdd80ed2ae7e3156d8c33108c5d5-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 13:17:16 GMT

bayesian regret, information ratio, mdp, (14 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

b733cdd80ed2ae7e3156d8c33108c5d5-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 13:17:12 GMT

bayesian regret, information ratio, mdp, (12 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

0b13c22ca208bc08f3fd13793292f25f-Paper-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 16:46:25 GMT

algorithm, policy optimization algorithm, sample complexity, (9 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada > Alberta (0.14)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Neural Information Processing SystemsDec-25-2025, 00:48:21 GMT

This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on diameter-like quantities or ergodicity, and smoothly interpolate between the gap dependent logarithmic-regret, and the $\widetilde{\mathcal{O}}(\sqrt{HSAT})$-minimax rate. The key technique in our analysis is a novel ``clipped'' regret decomposition which applies to a broad family of recent optimistic algorithms for episodic MDPs.

name change, non-asymptotic gap-dependent regret bound, tabular mdp, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.88)

Add feedback

Nearly Horizon-Free Offline Reinforcement Learning

Neural Information Processing SystemsDec-24-2025, 09:37:25 GMT

We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by 1. Specifically: For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}(H, S, A, d)$ in higher-order term. For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by [Cui and Yang 2020] that has additional $\mathrm{poly} (H, S, d)$ factors in the main term.To the best of our knowledge, these are the first set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a total variance term in the offline scenarios, which could be of individual interest.

electronic proceedings, horizon-free offline reinforcement learning, name change, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.52)

Add feedback

Filters

Collaborating Authors

tabular mdp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

efb9629755e598c4f261c44aeb6fde5e-Paper-Conference.pdf

0b13c22ca208bc08f3fd13793292f25f-Paper-Conference.pdf

Offline-Online Reinforcement Learning for Linear Mixture MDPs

Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward

62a9c80248963f348778a9c0bec060dd-Paper-Conference.pdf

b733cdd80ed2ae7e3156d8c33108c5d5-Supplemental-Conference.pdf

b733cdd80ed2ae7e3156d8c33108c5d5-Paper-Conference.pdf

0b13c22ca208bc08f3fd13793292f25f-Paper-Conference.pdf

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Nearly Horizon-Free Offline Reinforcement Learning